A Software-Based Hardware Fault Tolerance Scheme for Multicomputers

نویسندگان

  • Yuval Tamir
  • Eli Gafni
چکیده

A hardware fault tolerance scheme for large multicomputers executing time-consuming non-interactive applications is described. Error detection and recovery are done mostly by software with little hardware support. The scheme is based on simultaneous execution of identical copies of the application on two subnetworks of the system. Normal system operation is periodically suspended and the logical states of the two subnetworks are synchronized. Errors are detected by comparing the ‘‘frozen’’ synchronized states of the two subnetworks while they are being saved as ‘‘checkpoints’’ for possible subsequent use for error recovery. Algorithms for error detection and recovery using this scheme are discussed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fault-Tolerant Multicasting in Multistage Interconnection Networks

In this paper, we study fault-tolerant multicasting in multistage interconnection networks (MINs) for constructing large-scale multicomputers. In addition to point-to-point routing among processor nodes, efficient multicasting is critical to the performance of multicomputers. This paper presents a new approach to provide fault-tolerance multicasting, which employs the restricted header encoding...

متن کامل

Algorithm - Based Fault - Tolerant Strategies in FaultyHypercube and Star

This dissertation addresses the design of algorithm-based fault-tolerant strategies in faulty hypercube and star graph multicomputers without hardware modi cation. Several new concepts and designs are presented here under the permanent and transient fault models. Under the permanent fault model, we propose a new fault-tolerant recon guration scheme in the faulty hypercube and star graph multico...

متن کامل

Fault Tolerance for Multiprocessor Systems Via Time Redundant Task Scheduling

Fault tolerance is often considered as a good additional feature for multiprocessor systems but nowadays it is becoming an essential attribute. Fault tolerance can be achieved by the use of dedicated customized hardware that may have the disadvantage of large cost. Another approach to fault tolerance is to exploit existing redundancy in multiprocessor systems via a task scheduling software stra...

متن کامل

Design and Analysis of Transient Fault Tolerance for Multi Core Architecture

This paper describes the software approach of fault tolerance for shared memory multi core system using PLR.PLR uses a software-centric approach transient fault tolerance which ensuring a correct software execution. This scheme is used at user space level which does not necessitate changes to the original application.PLR create a set of redundant process per application process. In this scheme ...

متن کامل

High-Coverage Fault Tolerance in Real-Time Systems Based on Point-to-Point Communication

The distributed recovery block (DRB) scheme is a widely applicable approach for realizing both hardware and software fault tolerance in real-time distributed and parallel computer systems. One of the most important extensions of the DRB scheme which were outlined in recent years but not developed fully is the integration of the DRB scheme and a network surveillance (NS) scheme. We recently deve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1987